A Statistical Correction-Rejection Strategy for OCR Outputs in Persian Personal Information Forms
نویسندگان
چکیده
-In this paper, a MAP statistical modeling *approach has been utilized to correct and verify Persian names and surname OCR outputs. In addition, an efficient Neural Network based rejection method has been presented and tested. Due to large variety of Persian surnames, a statistical grammar has been added to the MAP strategy, to make new surnames, which are not included in the dictionary. The model has been analytically formulated and practically implemented. The achieved results show a large character and word error reduction while the calculation increase is negligible in comparison with character recognition complexity.
منابع مشابه
A new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملRetrieving Arabic Printed Document: a Survey
This paper surveys some of the literature pertaining to searching and retrieving OCR’ed printed documents with emphasis on Arabic documents. It examines peculiarities of Arabic morphology, orthography, retrieval, word clustering, display, OCR, and error correction. The paper surveys existing evaluation test-beds for retrieval of Arabic OCR texts. Lastly, it concludes with possible directions fo...
متن کاملLinguistic Error Correction Of Japanese Sentences
This paper describes a newly developed linguistic error correction system, which can correct errors and rejections of Japanese sentences by using linguistic knowledge. Conventional optical character readers (OCR) need human assistance to correct their recognition errors and rejections. An operator must teach the OCR correct answers whenever an illegible character pattern occurs. If this error c...
متن کاملJapanese OCR Error Correction using Character Shape Similarity and Statistical Language Model
We present a novel OCR error correction method for languages without word delimiters that have a large character set, such as Japanese and Chinese. It consists of a statistical OCR model, an approximate word matching method using character shape similarity, and a word segmentation algorithm using a statistical language model. By using a statistical OCR model and character shape similarity, the ...
متن کاملGenerating a Training Corpus for OCR Post-Correction Using Encoder-Decoder Model
In this paper we present a novel approach to the automatic correction of OCR-induced orthographic errors in a given text. While current systems depend heavily on large training corpora or external information, such as domain-specific lexicons or confidence scores from the OCR process, our system only requires a small amount of relatively clean training data from a representative corpus to learn...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005